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Abstract 

We seek decision rules for prediction-time cost 
reduction, where complete data is available for 
training, but during prediction-time, each fea¬ 
ture can only be acquired for an additional cost. 

We propose a novel random forest algorithm to 
minimize prediction error for a user-specified av¬ 
erage feature acquisition budget. While ran¬ 
dom forests yield strong generalization perfor¬ 
mance, they do not explicitly account for fea¬ 
ture costs and furthermore require low correla¬ 
tion among trees, which amplifies costs. Our 
random forest grows trees with low acquisition 
cost and high strength based on greedy minimax 
cost-weighted-impurity splits. Theoretically, we 
establish near-optimal acquisition cost guaran¬ 
tees for our algorithm. Empirically, on a num¬ 
ber of benchmark datasets we demonstrate supe¬ 
rior accuracy-cost curves against state-of-the-art 
prediction-time algorithms. 

1. Introduction 

In many applications such as surveillance and retrieval, we 
acquire measurements for an entity, and features for a query 
in order to make a prediction. Features can be expensive 
and complementary, namely, knowledge of previously ac¬ 
quired feature values often renders acquisition of another 
feature redundant. In these cases, the goal is to maximize 
prediction performance given a constraint on the average 
feature acquisition cost. Our proposed approach is to learn 
decision rules for prediction-time cost reduction (Kanani & 
Melville, 2008) from training data in which the full set of 
features and ground truth labels are available for training. 

We propose a novel random forest learning algorithm to 


minimize prediction error for a user-specified average fea¬ 
ture acquisition budget. Random forests (Breiman, 2001) 
construct a collection of trees, wherein each tree is grown 
by random independent data sampling & feature split¬ 
ting, producing a collection of independent identically dis¬ 
tributed trees. The resulting classifiers are robust, are easy 
to train, and yield strong generalization performance. 

Although well suited to unconstrained supervised learn¬ 
ing problems, applying random forests in the case of 
prediction-time budget constraints presents a major chal¬ 
lenge. First, random forests do not account for feature ac¬ 
quisition costs. If two features have similar utility in terms 
of power to classify examples but have vastly different 
costs, random forest is just as likely to select the high cost 
feature as the low cost alternative. This is obviously un¬ 
desirable. Second, a key element of random forest perfor¬ 
mance is the diversity amongst trees (Breiman, 2001). Em¬ 
pirical evidence suggest a strong connection between diver¬ 
sity and performance, and generalization error is bounded 
not only with respect to the strength of individual trees but 
also the correlation between trees (Breiman, 2001). High 
diversity amongst trees constructed without regard for ac¬ 
quisition cost results in trees using a wide range of features, 
and therefore a high acquisition cost (See Section 3). 

Thus, ensuring a low acquisition cost on the forest hinges 
on growing each tree with high discriminative power and 
low acquisition cost. To this end, we propose to learn de¬ 
cision trees that incorporates feature acquisition cost. Our 
random forest grows trees based on greedy minimax cost- 
weighted-impurity splits. Although the problem of learn¬ 
ing decision trees with optimally low-cost is computation¬ 
ally intractable, we show that our greedy approach outputs 
trees whose cost is closely bounded with respect to the op¬ 
timal cost. Using these low cost trees, we construct ran¬ 
dom forests with high classification performance and low 
prediction-time feature acquisition cost. 
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Abstractly, our algorithm attempts to solve an empirical 
risk minimization problem subject to a budget constraint. 
At each step in the algorithm, we add low-cost trees to the 
random forest to reduce the empirical risk until the budget 
constraint is met. The resulting random forest adaptively 
acquires features during prediction time, with features only 
acquired when used by a split in the tree. In summary, our 
algorithm is greedy and easy to train. It can not only be 
parallelized, but also lends itself to distributed databases. 
Empirically, it does not overfit and has low generalization 
error. Theoretically, we can characterize the feature acqui¬ 
sition cost for each tree and for the random forest. Em¬ 
pirically, on a number of benchmark datasets we demon¬ 
strate superior accuracy-cost curves against state-of-the-art 
prediction-time algorithms. 

Related Work: The problem of learning from full train¬ 
ing data for prediction-time cost reduction (MacKay, 1992; 
Kanani & Melville, 2008) has been extensively studied. 
One simple structure for incorporating costs into learning is 
through detection cascades (Viola & Jones, 2001; Zhang & 
Zhang, 2010; Chen et ah, 2012), where cheap features are 
used to discard examples belonging to the negative class. 
Different from our apporach these approaches require a 
fixed order of features to be acquired and do not generalize 
well to multi-class. Bayesian approaches have been pro¬ 
posed which model the system as a POMDP (Ji & Carin, 
2007; Kapoor & Horvitz, 2009; Gao & Roller, 2011), how¬ 
ever they require estimation of the underlying probability 
distributions. To overcome the need to estimate distribu¬ 
tions, reinforcement learning (Karayev et ah, 2013; Busa- 
Fekete et ah, 2012; Dulac-Arnold et ah, 2011) and imitation 
learning (He et ah, 2012) approaches have also been stud¬ 
ied, where the reward or oracle action is predicted, however 
these generally require classifiers capable of operating on a 
wide range of missing feature patterns. 

Supervised learning approaches with prediction-time bud¬ 
gets have previously been studied under an empirical risk 
minimization framework to learn budgeted decision trees 
(Xu et ah, 2013; Kusner et ah, 2014; Trapeznikov & 
Saligrama, 2013; Wang et ah, 2014b;a). In this setting, con¬ 
struction of budgeted decision cascades or trees has been 
proposed by learning complex decision functions at each 
node and leaf, outputting a tree of classifiers which adap¬ 
tively select sensors/features to be acquired for each new 
example. Common to these systems is a decision structure, 
which is a priori fixed. The entire structure is parameter¬ 
ized by complex decision functions for each node, which 
are then optimized using various objective functions. In 
contrast we build a random forest of trees where each tree 
is grown greedily so that global collection of random trees 
meets the budget constraint. 

Construction of simple decision trees with low costs has 


also been studied for discrete function evaluation problems 
(Cicalese et ah, 2014; Moshkov, 2010; Bellala et ah, 2012). 
Different from our work these trees operate on discrete data 
to minimize function evaluations, with no notion of test 
time prediction or cost. 

As for Random forests despite their widespread use in su¬ 
pervised learning, to our knowledge they have not been ap¬ 
plied to prediction-time cost reduction. 

2. Feature-Budgeted Random Forest 

We first present the general problem of learning un¬ 
der prediction-time budgets similar to the formulation in 
(Trapeznikov & Saligrama, 2013; Wang et ah, 2014b). 
Suppose example/label pairs (x,y) are distributed as 
(x, y ) ~ H. The goal is to learn a classifier / from a fam¬ 
ily of functions T that minimizes expected loss subject to 
a budget constraint: 

min E xy [L(y,f(x))\, s.t. E x [C (f,x)] < B, (1) 

J 

where L(y , y) is a loss function, C(f, x) is the cost of eval¬ 
uating the function of f on example x and B is a user spec¬ 
ified budget constraint. In this paper, we assume that the 
feature acquisition cost C(f, x) is a modular function of 
the support of the features used by function / on exam¬ 
ple x , that is acquiring each feature has a fixed constant 
cost. Without the cost constraint, the problem is equivalent 
to a supervised learning problem, however, adding the cost 
constraint makes this a combinatorial problem (Xu et ah, 
2013). In practice, we are not given the distribution but in¬ 
stead are given a set of training data (xi,yi ),..., (x n , y n ) 

drawn IID with ( Xi,yi ) ~ H. We can then minimize the 
empirical loss subject to a budget constraint: 

-t n i n 

mm _s - 1 - - x i) < B - ( 2 ) 

feJ 7 n z ' n z ' 

i—i ; = i 

In our context the classifier / is a random forest, T, consist¬ 
ing of K random trees, 1) \. D 2 , ..., 1) k , that are learnt 
on training data. Consequently, the expected cost for an in¬ 
stance x during prediction-time can be written as follows: 

K 

E f [E x [C (/, x)]] < J2 E d 3 [E x [C (Dj,x)]] (3) 

j=i 

where, in the RHS we are averaging with respect to the ran¬ 
dom trees. As the trees in a random forest are identically 
distributed the RHS scales with the number of trees. This 
upper-bound captures the typical behavior of a random for¬ 
est due to the low feature correlation among trees. 

As a result of this observation, the problem of learning a 
budgeted random forest can be viewed as equivalent to the 
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problem of finding decision trees with low expected evalu¬ 
ation cost and error. This motivates our algorithm Bud- 
GETRF, where greedily constructed decision trees with 
provably low feature acquisition cost are added until the 
budget constraint is met according to validation data. The 
returned random forest is a feasible solution to (1) with 
strong empirical performance. 

2.1. Our Algorithm 

During Training: As shown in Algorithm 1, there 
are seven inputs to BudgetRF: impurity function F, 
prediction-time feature acquisition budget B , a cost vec¬ 
tor C £ 5ft m that contains the acquisition cost of each fea¬ 
ture, training class labels y tr and data matrix X tr £ 5ft nxm , 
where n is the number of samples and m is the number of 
features, validation class labels y tv and data matrix X tv . 
Note that the impurity function F needs to be admissible, 
which essentially means monotone and supermodular. We 
defer the formal definition and theoretical results to Section 

2.2. For now it is helpful to think of an impurity function F 
as measuring the heterogeneity of a set of examples. Intu¬ 
itively, F is large for a set of examples with mostly different 
labels and small for a set with mostly the same label. 


Algorithm 1 BUDGETRF 

1: procedure BudgetRF(F, B, C, ytr, Xtr, ytv, Xtv ) 
2: T^0. 

3: while Average cost using validation set on T < B 

do 

4: Randomly sample n training data with replace¬ 

ment to form JfW and y^K 
5: Train T £- GreedyTree(F, C, y«, X«). 

6: T<-TUT. 

7: return T\T. 

Subroutine - GreedyTree 

8: procedure GreedyTree(.F, C, y, X) 

9: S £- (y, X) > the current set of examples 

10: if F(S) = 0 then return 

11: for each feature t = 1 to m do 

12: Compute R(t) := min max F/ g ' C( r/ gi n , 

gt€.Gti€° u tcomes^^-' ^ at' 

t> risk for feature t 

13: where S l gt is the set of examples in S that has 

outcome i using classifier g t with feature t. 

14: t £- argmin t R(t) 

1C . • c(t) 

15: g <— argmin max % 

g-^Q- outcomes 

16: Make a node using feature t and classifier g. 

17: for each outcome i of g do 

18: GreedyTree(F, C, y l ~, XI) to append as 

child nodes. 


BudgetRF iteratively builds decision trees by calling 


GreedyTree as a subroutine on a sampled subset of ex¬ 
amples from the training data until the budget B is ex¬ 
ceeded as evaluated using the validation data. The ensem¬ 
ble of trees are then returned as output. As shown in sub¬ 
routine GreedyTree, the tree building process is greedy 
and recursive. If the given set of examples have zero im¬ 
purity as measured by F, they are returned as a leaf node. 
Otherwise, compute the risk R(t) for each feature t, which 
involves searching for a classifier g t among the family of 
classifiers Q t that minimizes the maximum impurity among 
its outcomes. Intuitively, a feature with the least R(t) can 
uniformly reduce the impurity among all its child nodes 
the most with the least cost. Therefore such a feature t is 
chosen along with the corresponding classifier g. The set 
of examples are then partitioned using g to different child 
nodes at which GreedyTree is recursively applied. Note 
that we allow the algorithm to reuse the same feature for 
the same example in GreedyTree. 

During Prediction: Given a test example and a decision 
forest T returned by BudgetRF, we run the example 
through each tree in T and obtained a predicted label from 
each tree. The final predicted label is simply the majority 
vote among all the trees. 

Different from random forest, we incorporate fea¬ 
ture acquisition costs in the tree building subroutine 
GreedyTree with the hope of reducing costs while main¬ 
taining low classification error. Our main theoretical con¬ 
tribution is to propose a broad class of admissible impurity 
functions such that on any given set of n! examples the tree 
constructed by GreedyTree will have max-cost bounded 
by 0(log n') times the optimal max-cost tree. 

2.2. Bounding the Cost of Each Tree 

Given a set of examples S with features and correspond¬ 
ing labels, a classification tree D has a feature-classifier 
pair associated with each internal node. A test example is 
routed from the root of I? to a leaf node directed by the 
outcomes of the classifiers along the path; the test example 
is then labeled to be the majority class among training ex¬ 
amples in the leaf node it reaches. The feature acquisition 
cost of an example s £ S on D, denoted as cost(D, s ), is 
the sum of all feature costs incurred along the root-to-leaf 
path in D traced by s. Note that if s encounters a feature 
multiple times in the path, the feature cost contributes to 
cost{D, s ) only once because subsequent use of a feature 
already acquired for the test example incurs no additional 
cost. We define the total max-cost as 

Cost(D) = ma xcost(D, s ). 

We aim to build a decision tree for any given set of exam¬ 
ples such that the max-cost is minimized. Note that the 
max-cost criterion bounds the expected cost criterion of 
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Eq. 3. While this bound could be loose we show later (see 
Sec. 2.4) that by parameterizing a suitable class of impurity 
functions, the max-costs of our GreedyTree solution can 
be “smoothened” so that it approaches the expected-cost. 

First define the following terms: n! is the number of ex¬ 
amples input to GreedyTree and m is the number of 
features, each of which has (a vector of) real values; F 
is the given impurity function; F(S) is the impurity on 
the set of examples S; Dp is the family of decision trees 
with F(L) = 0 for any of its leaf L; each feature has a 
cost c(f); a family of classifiers Gt is associated with fea¬ 
ture t; Costp(S) is the max-cost of the tree constructed 
by GreedyTree using impurity function F on S; and as¬ 
sume no feature is used more than once on the same ex¬ 
ample in the optimal decision tree among Dp that achieves 
the minimum max-cost, which we denote as OPT(S) for 
the given input set of examples S. Note the assumption 
here is a natural one if the complexity of Gt is high enough. 
We show the 0(logn / ) approximation holds for the max- 
cost of the optimal testing strategy using the GreedyTree 
subroutine if the impurity function F is admissible. 

Definition A function F of a set of examples is admis¬ 
sible if it satisfies the following five properties: (1) Non¬ 
negativity: F(G) > 0 for any set of examples G; (2) Purity: 
F(G) = 0 if G consists of examples of the same class; (3) 
Monotonicity: F(G) > F(R),\/R C G; (4) Supermod- 
ularty: F(G U j) — F(G) > F(R U j) — F{R) for any 
R C G and example j R; (5) log(F(S)) = O(logn'). 

Since the set S is always finite, by scaling F we can assume 
the smallest non-zero impurity of F is 1. Let r and g T be 
the first feature and classifier selected by GreedyTree 
at the root and let S'- be the set of examples in S that 
has outcome i using classifier g T . Note the optimization of 
classifier in Line (12) of Algorithm 1 needs not to be exact. 
We say GreedyTree is A -greedy if g T is chosen such that 

c (t) • A c(t) 

max —7—- , . , < mm max —r- , , 

i e outcomes F(S) — F(R l g ) 9tEGt iGoutcomes F (S) — F(S g ) 


for some constant A > 1. By definition of max-cost, 

Costp(S) ^ c(r)+maxGosf F (%) 

OPT(S) ~ OPT{S ) ’ 

because feature r could be selected multiple times by 
GreedyTree along a path and the feature cost c(r) con¬ 
tributes only once to the cost of the path. 

Let q be such that Costp(S < ~ ) = maxGosf F (5'l ). We 

first provide a lemma to lower bound the optimal cost, 
which will later be used to prove a bound on the cost of 
the tree. 


Lemma 2.1 Let F be monotone and supennodular; let 
t and g T be the first feature and classifier chosen by 
GREEDYTREE A -greedily on the set of examples S, then 

c( t)F(S)/(F(S) - F(Sl)) < XOPT(S). 


Proof Let D* £ Dp be a tree with optimal max-cost. 
Let v be an arbitrarily chosen internal node in D*, let 7 
be the feature associated with v and g* the correspond¬ 
ing classifier. Let R C S be the set of examples asso¬ 
ciated with the leaves of the subtree rooted at v. Let i 
be such that c(t)/(F(S) — P{S~ )) is maximized. Let 
mm _ ar g m j n max Let w be such that 

7 s-fetA *£outcomes ^(S) —F(S g ^) 

c("/)/(F(S) — F(S™ mi n)) is maximized; similarly let j be 

such that c( r y)/(F(S) — F(S J g ,)) is maximized. We then 
have: 


c(t) 


F(S)-F(Sl) 


< 


c(t) 


< 


Ac(t) 


F(S)-F(Sl) 
Ac(t) 


< 


Ac(l) 


F{S)-F(SL) 


< 


F(R)-F(R? g .) 


F(S) - F(S™ zn ) 


(4) 


The first inequality follows from the definition of i. The 
second inequality follows from the A-greedy choice at the 
root. The third inequality follows from the minimization 
over classifiers given feature 7 . To show the last inequality, 
we have to show F(S) — F(S J g .) > F(R) — F(R^ g ,). This 

follows from the fact that S U R C S and = S (T R 
and therefore F(S) > F(S J g » UK) > F(S J g ,) + F(R) — 
F(R gt ), where the first inequality follows from mono¬ 
tonicity and the second follows from the definition of su¬ 
permodularity. 


For a node v, let S(v) be the set of examples associated 
with the leaves of the subtree rooted at v. Let iq, V 2 , ■ ■ ■, v p 
be a root-to-leaf path on D* as follows: vi is the root of 
the tree, and for each i = 1 ..... p - 1 the node v 1:+ \ is a 
child of Vi associated with the branch of j that maximizes 
c(ti)/(F(S)—F(S J g * )), where t t is the test associated with 
Vi. It follows from (4) that 

[F(S(vj)) - F(S(v i+1 ))]c(T) 

A (F(S)-F(Sl)) 

Since the cost of the path from vi to v p is no larger than the 
max cost of the D* , we have that 

p -1 

OPT(S)>Y,Cu 

i =1 

c ( t )(F(S) — F(S(v p )) _ c(t)F(S ) 

A (F(S)-F(SD) A (F(S)-F(Sl)Y 
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The main theorem of this section is the following. 

Theorem 2.2 GreedyTree constructs a decision tree 
achieving O (log n')-factor approximation of the optimal 
max-cost in Dp on the set S of n' examples if F is ad¬ 
missible and no feature is used more than once on any path 
of the optimal tree. 

Proof This is an inductive proof: 

Cost F {S) c(t) + Cost F {Sp 

OPT(S) ~ OPT(S) ( } 

< c ( t ) Costp(Sl) 

~ OPT(S) OPT(Sp ’ 

F(S)-F(SD CostpjSp 

F(S) OPT(S-) K) 

^ Al0g( ^f) ) + Al0g(F( ^ )) + 1 (9) 

= Alog(F(5)) + l = 0(log(n , )). (10) 

The inequality in (7) follows from the fact that OPT(S) > 
OPT{S q ~ ). (8) follows from Lemma 2.1. The first term 
in (9) follows from the inequality ffp < log(l + x) for 
x > — 1 and the second term follows from the induction 
hypothesis that for each G C S, Costp(G)/OPT(G) < 
A log (F(G)) + 1. If F(G) = 0 for some set of examples 
G, we define Costp(G)/OPT(G ) = 1. 

We can verify the base case of the induction as follows, if 
F(G) = 1, which is the smallest non-zero impurity of F on 
subsets of examples S, we claim that the optimal decision 
tree chooses the feature with the smallest cost among those 
that can reduce the impurity function F: 

OPT{G) = min c(t). 

£|3<7t,s.t. F{G i g± )=0,Vi£ outcomes 

Suppose otherwise, the optimal tree chooses first a feature t 
with a child node G' such that F(G') = 1 and later chooses 
another feature t' such that all the child nodes of G' by gp 
has zero impurity, then t' could have been chosen in the 
first place to reduce all child nodes of G to zero impurity 
by supermodularity of F. On the other hand, ll.(t) = oo in 
GreedyTree for the features that cannot reduce impurity 
and R(t) = c(t ) for those features that can. So the algo¬ 
rithm would pick the feature among those that can reduce 
impurity and have the smallest cost. Thus, we have shown 
that Costp(G)/OPT(G) = 1 < A log (F(G)) + 1 for the 
base case. 

2.3. Admissible Impurity Functions 

A wide range of functions falls into the class of admissible 
impurity functions. We employ a particular function called 


threshold-Pairs in our paper defined as 

F a {G) = ^[Kj - a] + [n J G - a} + - a 2 ] + , (11) 

where n l G denotes the number of objects in G that belong 
to class i, = max(:r, 0 ) and a is a threshold param¬ 
eter. We include the proof of the following lemma in the 
Appendix. 

Lemma 2.3 F a (G) is admissible. 

Neither entropy nor Gini index satisfies the notion of ad¬ 
missibility because they are not monotonic set functions, 
that is a subset of examples does not necessarily have a 
smaller entropy or Gini index compared to the entire set. 
Therefore traditional decision tree learning algorithms do 
not incorporate feature costs and have no guarantee on the 
max-cost as stated in our paper. We have studied more im¬ 
purity functions that are admissible such as the polynomials 
and Powers family of functions. After conducting experi¬ 
ments on smaller datasets we noted that they do not offer 
significant advantage over the threshold-Pairs used in this 
paper. Please see Appendix for more details. 

2.4. Discussions of the Algorithm 

Before concluding the BudgetRF algorithm and its anal¬ 
ysis, we discuss further various design issues as well as 
their implications. 

Choice of threshold a. In subroutine GreedyTree, each 
tree is greedily built until a minimum leaf impurity is met, 
then added to the random forest. The threshold a can be 
used to trade-off between average tree depth and number of 
trees. A lower a results in deeper trees with higher classi¬ 
fication power and acquisition cost. As a result, fewer trees 
are added to the random forest before the budget constraint 
is met. Conversely, a higher a yields shallower trees with 
poorer classification performance, however due to the low 
cost of each tree, many are added to the random forest be¬ 
fore the budget constraint is met. As such, a can be viewed 
as a bias-variance trade-off. In practice, it is selected using 
validation dataset. 

Another observation we make is that the choice of a can 
potentially lead to different feature choice when used in 
GreedyTree. To illustrate this point, consider the toy 
example in Figure 1. A set G has 30 examples in class 
1 (circles) and 30 examples in Class 2 (triangles). Two 
features f and t 2 are available to the algorithm at equal 
cost. Feature t \ has only one classifier in Q t ] as drawn on 
the upper left of the figure, which can separate 20 examples 
of Class 2 from the rest of the examples while t 2 has only 
one classifier in Q t2 as drawn on the lower left of the figure, 
which evenly divides the examples into halves with equal 
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Figure 1. Illustration of different a setting in threshold-Pairs for 
different greedy choice of features. The left two figures above 
show the feature-classifier outcomes of features 1 1 and t 2 . The 
right figure shows the classification error against cost (number of 
features). Here setting a — 0 leads to choosing t 2 because it 
prefers balanced splits; setting a = 8 leads to choosing ti, which 
is better from an error-cost trade-off point of view. 

number of examples from Class 1 and Class 2 in either half. 
Intuitively, t 2 is not a useful feature from a classification 
point of view because it cannot separate examples based 
on class at all. This is reflected in the right plot of Figure 1 : 
choosing t 2 increases cost but does not reduce classification 
error while choosing t\ reduces the error to If ct is set 
to 0 in the threshold-Pairs, feature t 2 will be chosen due 
to the fact that Pairs biases towards feature-classifiers with 
balanced outcomes. In contrast, setting a = 8 leads to 
feature t \, and therefore may be preferable (see Appendix). 

Minimax-splits. The splitting criterion in the subroutine 
GreedyTree is based on the worst case impurity among 
child nodes, we call such splits minimax-splits as opposed 
to expected-splits, which is based on the expected impu¬ 
rity among child nodes. Using minimax-splits, our theo¬ 
retical guarantee is a bound on the max-cost of individ¬ 
ual trees. Note such minimax-splits have been shown to 
lead to expected-cost bound as well in the setting of GBS 
(Nowak, 2008); an interesting future research direction is 
to show whether minimax-splits can lead to a bound on the 
expected-cost of individual trees in our setting. 

Smoothened Max-Costs. We emphasize that by adjust¬ 
ing a in threshold-Pairs function - essentially allowing 
some error, the max-costs of the GreedyTree solution 
can be “smoothened” so that it approaches the expected- 
cost. Consider the synthetic example as shown in Figure 
2. Here we consider a multi-class classification exam¬ 
ple to demonstrate the effect of “smoothened” max-cost of 
the tree approaching the expected-cost. Consider a data set 
composed of 1024 examples belonging to 4 classes with 
10 binary features available. Assume that is no two exam¬ 
ples that have the same set of feature values. Note that by 



Figure 2. A synthetic example to show max-cost of 
GreedyTree can be “smoothened” to approach the expected- 
cost. The left and right figures above show the classifier outcomes 
of feature ti and t 2 , respectively. 



Figure 3. The error-cost trade-off plot of the subroutine 
GreedyTree using threshold-Pairs on the synthetic example. 
0.39% error can be achieved using only a depth-2 tree but it takes 
a depth-10 tree to achieve zero error. 

fixing the acquisition order of the features, the set of fea¬ 
ture values maps each example to an integer in the range 
[0,1023]. From this mapping, we give the examples in the 
ranges [1, 255] , [257, 511] , [513, 767], and [769,1023] the 
labels 1, 2, 3, and 4, respectively, and the examples 0, 256, 
512, and 768 the labels 2, 3, 4, and 1, respectively (Figure 
2 shows the data projected to the first two features). Sup¬ 
pose each feature carries a unit cost. By Kraft’s Inequal¬ 
ity (Cover & Thomas, 1991), the optimal max-cost in order 
to correctly classify every object is 10, however, using only 
t 1 and t 2 as selected by the greedy algorithm, leads to a 
correct classification of all but 4 objects, as shown in Fig¬ 
ure 3. Thus, the max-cost of the early stopped tree is only 
2 - much closer to the expected-cost. 

3. Experiments 

For establishing baseline comparisons we apply Bud- 
GETRF on 4 real world benchmarked datasets. The first 
one has varying feature acquisition costs in terms of com¬ 
putation time and the purpose is to show our algorithm 
can achieve high accuracy during prediction while saving 
massive amount of feature acquisition time. The other 3 
datasets do not have explicit feature costs; instead, we as¬ 
sign a unit cost to each feature uniformly. The purpose is 
to demonstrate our algorithm can achieve low test error us¬ 
ing only a small fraction of features. Note our algorithm is 
adaptive, meaning it acquires different features for different 
examples during testing. So the feature costs in the plots 































Feature-Budgeted Random Forest 


£ CD 
.Q +Z 0.12 

C/D CD 

o -Q 

CD (/) 

)— — 0.115 

q5 
CD j= 

0,1 

<D £ 

> 

< 0.105 



(a) Yahoo! Rank 


(b) MiniBooNE 



(c) Forest Covertype (d) CIFAR-10 

Figure 4. Comparison of BudgetRF against ASTC (Kusner et al., 2014) and CSTC (Xu et al., 2013) on 4 real world datasets. Bud- 
GETRF has a clear advantage over these state-of-the-art methods as it achieves high accuracy/low error using less feature costs. 


should be understood as an average of costs for all test ex¬ 
amples. We use CSTC (Xu et al., 2013) and ASTC (Kusner 
et al., 2014) for comparison because they have been shown 
to have state-of-the-art cost-error performance. For com¬ 
parison purposes we use the same configuration of train¬ 
ing/validation/test splits as in ASTC/CSTC. The algorithm 
parameters for ASTC are set using the same configuration 
as in (Kusner et al., 2014). We report values for CSTC from 
(Kusner et al., 2014). In all our experiments we use the 
threshold-Pairs (11) as impurity function. We use stumps 
as the family of classifiers Q t for all features t. The opti¬ 
mization of classifiers in line 12 of Algorithm 1 is approxi¬ 
mated by randomly generating 80, 40 and 20 stumps if the 
number of examples exceeds 2000, 500 and less than 500, 
respectively and select the best among them. All results 
from our algorithm were obtained by taking an average of 
10 runs and standard deviations are reported using error 
bars. 

Yahoo! Learning to Rank: (Chapelle et al.) We evalu¬ 
ate BudgetRF on a real world budgeted learning prob¬ 
lem: Yahoo! Learning to Rank Challenge 1 . The dataset 
consists of 473,134 web documents and 19, 944 queries. 

'http://webscope. sandbox.yahoo.com/catalog.php?datatype=c 


Given a set of training query-document pairs together with 
relevance ranks of documents for each query, the Chal¬ 
lenge is to learn an algorithm which takes a new query 
and its set of associated documents and outputs the rank 
of these documents with respect to the new query. Each ex¬ 
ample Xi contains 519 features of a query-document pair. 
Each of these features is associated with an acquisition 
cost in the set {1,5,20,50,100,150,200}, which repre¬ 
sents the units of time required for extraction and is pro¬ 
vided by a Yahoo! employee. The labels are binarized so 
that yi = 0 means the document is unrelated to the query 
in Xi whereas y L = 1 means the document is relevant to 
the query. There are 141, 397/146, 769/184,968 examples 
in training/validation/test sets. We use the Average Preci- 
sion@5 as performance metric, same as that used in (Kus¬ 
ner et al., 2014). To evaluate a predicted ranking for a test 
query, first sort the documents in decreasing order of the 
predicted ranks - that is, the more relevant documents pre¬ 
dicted by the algorithm come before those that are deemed 
irrelevant. Take the top 5 documents in this order and re¬ 
veal their true labels. If all of the documents are indeed 
relevant (y = 1), then the precision score is increased by 
1 ; otherwise, if the first unrelated document appears in po¬ 
sition 1 < j < 5, increase the precision score by S- • 
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Finally, the precision score is averaged over the set of test 
queries. We run BudgetRF using the threshold a = 0 
for the threshold-Pairs impurity function. To incorporate 
prediction confidence we simply run a given test example 
through the forest of trees to leaf nodes and aggregate the 
number of training examples at these leaf nodes for class 0 
and 1 seperately. The ratio of class 1 examples over the sum 
of class 1 and 0 examples gives the confidence of relevance. 
The comparison is shown in plot (a) of Figure 4. The pre¬ 
cision for BudgetRF rises much faster than ASTC and 
CSTC. At an average feature cost of 70, BudgetRF al¬ 
ready exceeds the precision that ASTC/CSTC can achieve 
using feature cost of 450 and more. In this experiment the 
maximum number of trees we build is 140; the precision is 
set to rise even higher if we were to use more trees. Bud¬ 
getRF thus represents a better ranking algorithm requiring 
much less wait time for users of the search engine. 

MiniBooNE Particle Identification Data Set: (Frank & 
Asuncion) The MiniBooNE data set is a binary classifica¬ 
tion task, with the goal of distinguishing electron neutri¬ 
nos (signal) from muon neutrinos (background). Each data 
point consists of 50 experimental particle identification 
variables (features). There are 45,523/19,510/65,031 
examples in training/validation/test sets. We ap¬ 
ply BudgetRF with a set of 10 values of a = 
[0, 2,4, 6, 8,10,15,25, 35,45]. For each a we build a for¬ 
est of maximum 40 trees using BudgetRF. Each point 
on the BudgetRF curve in (b) of Figure 4 corresponds to 
a a setting and the number of trees that meet the budget 
level. The final a is chosen using validation set. Our algo¬ 
rithm clearly achieves lower test error than both ASTC and 
CSTC on every point of the budget level. Indeed, using 
just about 6 features on average out of 50 , BudgetRF 
achieves lower test error than what can be achieved by 
ASTC or CSTC using any number of features. 

Forest Covertype Data Set: (Frank & Asuncion) The For¬ 
est data set contains cartographic variables to predict 7 for¬ 
est cover types. Each example contains 54 (10 continuous 
and 44 binary) features. There are 36,603/15,688/58,101 
examples in training/validation/test sets. We use the same 
a values as in MiniBooNE. The final a is chosen using val¬ 
idation set. In (c) of Figure 4, ASTC and CSTC struggles 
to decrease test error even at high feature budget whereas 
the test error of BudgetRF decreases rapidly as more fea¬ 
tures are acquired. We believe this dramatic performance 
difference is partly due to the distinct advantage of Bud¬ 
getRF in handling mixed continuous and discrete (cate¬ 
gorical) data where the optimal decision function is highly 
non-linear. 

CIFAR-10: (Krizhevsky, 2009) CIFAR-10 data set con¬ 
sists of 32x32 colour images in 10 classes. 400 features 
for each image are extracted using technique described in 


(Coates & Ng, 2011). The data are binarized by combin¬ 
ing the first 5 classes into one class and the others into the 
second class. There are 19,761/8,468/10,000 examples 
in training/validation/test sets. As shown in (d) of Figure 4 
BudgetRF initially has higher test error than ASTC when 
the budget is low; from a budget about 90 onward Bud¬ 
getRF outperforms ASTC while it outperforms CSTC on 
the entire curve. An important trend we see is that the errors 
for both ASTC and CSTC start to increase after some bud¬ 
get level. This indicates an issue of overfitting with these 
methods. We do not see such an issue with BudgetRF. 

As a general comment, we observe that in low-cost regions 
using higher a achieves lower test error whereas setting 
a = 0 leads to low test error at a higher cost. This is 
consistent with our intuition that setting a high value for 
a terminates the tree building process early and thus saves 
on cost, as a consequence more trees can be built within the 
budget. But as budget increases, more and more trees are 
added to the forest, the prediction power does not grow as 
fast as setting a to low values because the individual trees 
are not as powerful. 

Comments on standard Random Forest Cost is not in¬ 
corporated in the standard random forest (RF) algorithm. 
One issue that arises is how to incorporate budget con¬ 
straint. Our strategy was to limit the number of trees in 
the RF to control the cost. But this does not work well even 
if the acquisition costs are uniform for all features. We im¬ 
plemented Matlab version of RF with the default settings 
on the Forest, MiniBooNE and CIFAR datasets: fraction of 
input data to sample with replacement from the input data 
for growing each new tree is 1; number of variables to se¬ 
lect at random for each decision split is set to 8; minimum 
number of observations per tree leaf is 1. Compared to our 
BudgetRF algorithm using threshold-Pairs impurity with 
a = 0, the feature cost for RF is much higher as shown 
in Table 1. For example in the Forest experiment, after 
building 10 trees, RF uses 63.04% of total number of fea¬ 
tures for an average test example whereas BudgetRF uses 
only 23.21%. In terms of test error BudgetRF achieves 
0.1364,0.0786 and 0.3600 for Forest, MiniBooNE and 
CIFAR respectively using 10 trees, quite competitive to 
0.1318,0.0803 and 0.3594 obtained by RF. 2 For Yahoo! 
Rank dataset, RF does even worse because some features 
have very high cost and yet RF still uses them just like the 
less expensive features, resulting in high cost. 


2 average over 10 repeated runs of RF and BudgetRF. 
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Num. trees 

1 

2 

3 

4 

5 1 6 1 7 1 8 

9 

10 

Forest 

RF 

22.57 

40.68 

47.98 

52.40 

55.18 

57.05 

58.35 

59.82 

61.45 

63.04 

BudgetRF 

11.37 

15.55 

17.96 

19.40 

20.47 

21.21 

21.85 

22.37 

22.83 

23.21 

MiniB 

RF 

26.86 

42.92 

54.59 

63.74 

70.24 

75.15 

78.84 

81.45 

83.27 

85.73 

BudgetRF 

16.40 

25.76 

32.47 

37.74 

41.80 

45.66 

49.07 

52.28 

55.40 

57.80 

CIFAR 

RF 

3.86 

7.49 

10.98 

14.24 

17.38 

20.34 

22.84 

25.43 

28.02 

30.51 

BudgetRF 

2.62 

5.14 

7.48 

9.65 

11.78 

13.81 

15.76 

17.59 

19.38 

21.09 


Table 1. Percentage of average number of features used for dif¬ 
ferent number of trees on Forest Covertype, MiniBooNE and 
CIFAR-10 datasets. BudgetRF uses much fewer features com¬ 
pared to RF. 

4. Conclusion and Future Work 

We propose a novel algorithm to solve the budgeted learn¬ 
ing problem. Our approach is to build a random forest of 
low cost trees with theoretical guarantees. We demonstrate 
that our algorithm performance far exceeds the state-of- 
the-art algorithms on 4 real world benchmarked datasets. 
While we have explored the greedy algorithm based on 
minimax-splits, similar algorithm can be proposed based 
on expected-splits. An interesting future work is to exam¬ 
ine the theoretical and empirical properties of such algo¬ 
rithms. 
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Appendix 

Proof of Lemma 2.3 Before showing admissibility of the threshold-Pairs function in the multiclass setting, we first show 
F a (G ) is admissible for the binary setting. Consider the binary classification setting, let 

F a (G) = [[nh - a] + [n 2 G - a]+ - a 2 ] + . 

All the properties are obviously true except supermodularity. To show supermodularity, suppose R C G and object j f R. 
Suppose j belongs to the first class. We need to show 

F a (GUj) - F a (G ) > F a (RUj) - F a (R). (12) 


Consider 3 cases: 

(1) F a (R) = F a (R U j) = 0: The right hand side of (12) is 0 and (12) holds because of monotonicity of F a . 

(2) F a (R) = 0, F a (R U j) > 0, F a (G) = 0: (12) reduces to F a (G U j) > F a (R U j), which is true by monotonicity. 

(3) F a (R) = 0, F a (R U j) > 0, F a (G) > 0: Note that F a (G) > 0 implies that [n G — a] + [tig — a]+ — a 2 >0 which 
further implies n G > a, n 2 < > a. Thus the left hand side is 

F a (GUj)-F a (G) = 

(uq — a + 1 )(nQ — a) — a 2 — ((uq — ct)(ng — a) — a 2 ) 

= Uq — a. 


The right hand side is 


F a (R U j) = [n\ j — a + 1 ){n 2 R — a) — a 2 

= ( n R - a )( n R - a) - a 2 + ( n 2 R - a). 

If n\ j > a, F a (R) = max((n)j — a)(n 2 R — a) — a 2 , 0) = 0 because F a (R, U j) > 0 implies n 2 R > a. So F a (R. U j) < 
n 2 R -ot<n 2 G -a = F a (G\Jj) - F a (G). 

(4) F a {R) > 0: We have 

F a {Gl)j) - F a (G) =n 2 G -a>n R - a = F a (RUj) - F a (R). 

This completes the proof for the binary classification setting. To generalize to the multiclass threshold-Pairs function, 
again, all properties are obviously true except supermodularity, which follows from the fact that each term in the sum is 
supermodular according to the proof for binary setting. 


More Admissible Impurity Functions The following polynomial impurity function is also admissible. 

Lemma 4.1 Suppose there are k classes in G. Any polynomial function ofn G ,..., n G with non-negative terms such that 
n G ,..., n G do not appear as singleton terms is admissible. Formally, if 

M 

F(G ) = £ (nh)™ ■ ■ ■ (n k G )™, (13) 

i—l 

where 7 ,; ’s are non-negative, pij’s are non-negative integers and for each i there exists at least 2 non-zero pij’s, then F is 
admissible. 

Proof Properties (1),(2),(3) and (5) are obviously true. To show F is supermodular, suppose R C G and object j f II and 
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j belongs to class j, we have 

F(R U j) — F(R) 

= •••(”« + 1 ) P4i -"( n «) p<fc - 
ieij 

( n /t) P<1 ■ ■ ■ ( n it) Pw ■ • • i n R) Pik ] 

< E • • ■ (nh + !) PlJ • • • (4) Pifc - 

ieij 

= F(GUj)-F(G), 

where the first summation index set I 7 is the set of terms that involve n R . The inequality follows because (n R + 1 j p,J can 
be expanded so the negative term can be canceled, leaving a sum-of-products form for R , which is term-by-term dominated 
by that of G. 


Another family of admissible impurity functions is the Powers function. 


Corollary 4.2 Powers function 


is admissible for l = 2 ,3,.... 


F(G) = ~J2( n G)‘ 


(14) 


We compare the threshold-Pairs with various a values against the Powers function to study the effect of them on the tree 
building subroutine GreedyTree. We compare performance using 9 data sets from the UCI Repository in Figure 5. 
We assume that all features have a uniform cost. For each data set, we replace non-unique objects with a single instance 
using the most common label for the objects, allowing every data set to be complete (perfectly classified by the decision 
trees). Additionally, continuous features are transformed to discrete features by quantizing to 10 uniformly spaced levels. 
For trees with a smaller cost (and therefore lower depth), the threshold-Pairs impurity function outperforms the Powers 
impurity function with early stopping (higher a leads to earlier stopping), whereas for larger cost (and greater depth), the 
Powers impurity function outperforms threshold-Pairs. If a is set to 0, the difference between threshold-Pairs and Powers 
function is small. 


Details of Data Sets The house votes data set is composed of the voting records for 435 members of the U.S. House 
of Representatives (342 unique voting records) on 16 measures, with a goal of identifying the party of each member. 
The sonar data set contains 208 sonar signatures, each composed of energy levels (quantized to 10 levels) in 60 different 
frequency bands, with a goal of identifying The ionosphere data set has 351 (350 unique) radar returns, each composed of 
34 responses (quantized to 10 levels), with a goal of identifying if an event represents a free electron in the ionosphere. 
The Statlog DNA data set is composed of 3186 (3001 unique) DNA sequences with 180 features, with a goal of predicting 
whether the sequence represents a boundary of DNA to be spliced in or out. The Boston housing data set contains 13 
attributes (quantized to 10 levels) pertaining to 506 (469 unique) different neighborhoods around Boston, with a goal 
of predicting which quartile the median income of the neighborhood the neighborhood falls. The soybean data set is 
composed of 307 examples (303 unique) composed of 34 categorical features, with a goal of predicting from among 19 
diseases which is afflicting the soy bean plant. The pima data set is composed of 8 features (with continuous features 
quantized to 10 levels) corresponding to medical information and tests for 768 patients (753 unique feature patterns), with 
a goal of diagnosing diabetes. The Wisconsin breast cancer data set contains 30 features corresponding to properties of 
a cell nucleus for 569 samples, with a goal of identifying if the cell is malignant or benign. The mammography data set 
contains 6 features from mammography scans (with age quantized into 10 bins) for 830 patients, with a goal of classifying 
the lesions as malignant or benign. 

Details of Computation in Figure 1 If a = 0, we can compute impurity of each set of interest: Fq(G) = 30 x 
30 = 900, F 0 (GjJ = 30 x 10 = 300, F 0 (G? 1 ) = 0,F 0 (Gj ) = F 0 (G? 2 ) = 15 x 15 = 225; according to subroutine 
GreedyTree, we can compute R(t i) = max{ 900 * 300 , gpi-} = ^ Q,R(t 2 ) = max{ 90 Q l 225 , 900 * 225 = gk> so 
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Figure 5. Comparison of classification error vs. max-cost for the Powers impurity function in (14) for l = 2, 3,4, 5 and the threshold- 
Pairs impurity function. Note that for both House Votes and WBCD, the depth 0 tree is not included as the error decreases dramatically 
using a single test. In many cases, the threshold-Pairs impurity function outperforms the Powers impurity functions for trees with smaller 
max-costs, whereas the Powers impurity function outperforms the threshold-Pairs function for larger max-costs. 


t '2 will be chosen. On the other hand, the impurities for the threshold-Pairs with o: = 8 are F$(G) = 22 x 22 = 

484, = 22 x 2 = 44,F 8 (Gj i ) = 0, F$(Gl o ) = F$(Gt 2 ) = 7 x 7 = 49; again we can compute R(ti) = 

max { 484 - 44 ’ 48 l=o} = 315’ R &) = max { 484-49 > 4 8 4-49 = 355 } s0 wil1 be chosen - The above example shows that 
setting a = 0 has a stronger preference to balanced splits and may in some cases lead to poor classification result. 

























































































